2025-01-06
Assume we have the correct model (this is a big assumption!)
How should we talk about things like uncertainty and error?
How do we make an argument for one model over another?
P.values
Confidence Intervals
Frequentist alternatives
Bayesian Alternatives
Can this British lady do wild stuff with her tea?
Fisher’s proposed experiment:
8 cups of tea. 4 milk first, and 4 tea first
Each cup presented in random order, and she’s asked to classify each one
The “null hypothesis”: she’s just guessing
The p-value is “the probability of correctly classifying at least N cups by guessing”
At least 1 correct \(= 69/70 = 99\%\)
At least 2 correct \(= 53/70 = 76\%\)
At least 3 correct \(= 17/70 = 24\%\)
4 correct \(= 1/70 = 1.43\%\)
In the broadest sense, p-values indicate the probability of getting a result >, <, or absolutely greater than as the one in your sample if the null hypothesis is true.
In linear regression, p-values are typically calculated based on the normal distribution
Pr(>|t|) = .225 There is a 22.5% chance of seeing a result as extreme as our sample statistic if the null hypothesis is true
Pr(>t) = .04 There is a 4% chance of seeing a result greater than the sample statistic if the null hypothesis is true
Pr(<t) = .95 There is a 95% chance of seeing a result less than the sample statistic if the null hypothesis is true
.05 level means that around 1 in 20 of results would be false positives
From: McShane, B. B., & Gal, D. (2017). Statistical significance and the dichotomization of evidence. Journal of the American Statistical Association, 112(519), 885-895.
Often misinterpreted, even by experts
They’re continuous, but they tend to be treated as binary and conflated with importance
In large samples, they’re hard to read because they get infinitesimally small
In repeated sampling, 95% of the 95% confidence intervals will contain the correct value of \(\beta\)
Its misleading to say there’s a 95% chance that the population \(\beta\) falls between the ci boundaries. (frequentist stats assumes that \(\beta\) is fixed, so it doesn’t have a “chance” of doing anything.)
Better to say “I’m 95% certain this range contains the true value”
Pro: P.values and Confidence Intervals are based on the same basic assumptions, but CIs can be more informative about uncertainty and don’t lend themselves to the same dichotomous interpretation
Pro: Automatically scaled to the quantity of interest, so harder to confuse with a measure of substantive importance
Pro: can be used for equivalence testing (we’ll come back to this)
One proposed alternative: S-values AKA Surprisal or Shannon Entropy
\[\text{S value} = -log_2(\text{P})\]
Increasing values as evidence gets stronger, and the sizes are less extreme:
| P value | S Value |
|---|---|
| 0.9 | 0.01 |
| 0.06 | 4.05 |
| 0.05 | 4.32 |
| 0.01 | 6.64 |
| 0.001 | 10 |
| 0.000000001 | 30 |
| 1e-100 | 332 |
Pro: Still based on the same assumptions as a p-value, its just a transformation of the p.value that might be more intuitive for people.
Pro: easy calculation (for a computer, at least)
Con: Not widely adopted. All of this stuff relies on convention.
Bayesian methods potentially allow us to do the thing we wish we could do with p-values: quantify hypothesis probabilities
Downside is that it sort of requires everyone to adopt a totally different philosophical approach to probabilities.
Bayes Factors measure the weight of evidence in favor of one hypothesis relative to another.
Wagenmakers (2007) suggests a frequentist approximation of a Bayes Factor using the Bayesian Information Criterion.
Run a model with the variable of interest (model 1)
Run a model without the variable of interest (model 0)
Calculate the BIC for both
Calculate \(\frac{\exp(\text{BIC}_1)/2}{\exp(\text{BIC}_0)/2}\)
Numbers greater than 1 indicate evidence in favor of model 1, numbers less than 1 indicate evidence against model 1.
R-squared can be used for model comparison, but it doesn’t tell you a model is a good predictor.
Predicting Democratic voteshare using the 2016 share
| state | demshare | demshare_2016 | median_age |
|---|---|---|---|
| Alabama | 0.371 | 0.356 | 39.7 |
| Alaska | 0.447 | 0.416 | 35.6 |
| Arizona | 0.502 | 0.481 | 38.9 |
| Arkansas | 0.358 | 0.357 | 38.8 |
| California | 0.649 | 0.661 | 37.5 |
| Colorado | 0.569 | 0.527 | 37.2 |
\[ F = \frac{(RSS_{restricted} - RSS_{full})}{df_{restricted} - df_{full}} \div \frac{RSS_{full}}{df_{full}} \]
We can continue to include additional covariates in the model. But the “fair comparison” here is probably the same model -1 covariate (rather than comparing to the null model)
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 50 | 0.758 | ||||
| 49 | 0.00896 | 1 | 0.749 | 4.04e+03 | 5.31e-48 |
| 48 | 0.00889 | 1 | 6.62e-05 | 0.357 | 0.553 |
The F-test requires nested models, but the BIC and AIC allow comparisons across non-nested forms
\[{\displaystyle \mathrm {BIC} =n\ln(RSS/n)+k\ln(n)\ }\]
Basic method: set aside some observations, create model on the training set and then predict the test set
For small samples, K-fold or LOOCV are preferable: leave out one, predict, and then do it again. Then average the metric.
Linear Regression
51 samples
1 predictor
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 50, 50, 50, 50, 50, 50, ...
Resampling results:
RMSE Rsquared MAE
0.01380351 0.9871893 0.01096933
Tuning parameter 'intercept' was held constant at a value of TRUE
…but is this the right metric?
Conventional hypothesis testing “stacks the deck” in favor of finding no relationship, and failure to reject the null could simply be a function of small sample sizes.
Still, we often want to dismiss a claim:
Can citizen pressure campaigns make states more effective?
Do motor voter laws improve turnout?
Are some people actually psychic?
“Two one-sided tests”
Step 1: Set a minimum effect size (\(\Delta\)) worth caring about
Step 2: Conduct one-sided t-test for:
H01: \(\Delta \leq −Δ_l\)
H02: \(\Delta \geq Δ_u\)
If both of these tests are rejected, then we can conclude the observed effect is smaller than the minimum effect size.
Eskine (2013) found that people exposed to organic foods became morally judgmental.
Moery and Calin-Jageman (2016) attempt to replicate this finding
Moery and Calin-Jageman’s replication results look like this:
| Group | N | Mean | SD |
|---|---|---|---|
| Control | 95 | 5.25 | 0.95 |
| Treatment | 89 | 5.22 | 0.83 |
The estimated effect: 5.25 - 5.22 = 0.03
Minimal effect size \(\Delta = 0.43\)
What is the probability of seeing a mean difference greater than or equal to 0.03 if the true difference is 0.43?
What is the probability of seeing a mean difference less than or equal to to 0.03 if the true difference is -0.43?
“Hypothesis 2: Increasing the district magnitude from one to seven will not lead to a substantively meaningful change in the effective number of political parties when the effective number of ethnic groups is one.”
Substantively meaningful effect: changing the effective number of parties by 0.62 (where does he get that?)